Efficiency of Data Structures for Detecting Overlaps in Digital Documents

نویسندگان

  • Krisztián Monostori
  • Arkady B. Zaslavsky
  • Heinz W. Schmidt
چکیده

This paper analyses the efficiency of different data structures for detecting overlap in digital documents. Most existing approaches use some hash function to reduce the space requirements for their indices of chunks. Since a hash function can produce the same value for different chunks, false matches are possible. In this paper we propose an algorithm that can be used for eliminating those false matches. This algorithm uses a suffix tree structure, which is space consuming. We define a modified suffix tree that only considers chunks starting at the beginning of words and we show how the algorithm can work on this structure. We can alternatively reduce space requirements of a suffix tree by converting it to a directed acyclic graph. We show that suffix link information can be preserved in this new structure and the matching statistics algorithm still works with those modifications that we propose.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MatchDetectReveal: finding overlapping and similar digital documents

The Internet provides easy access to large collections of semi-structured digital documents. WWW browsers, search engines and the "cut & paste" technique are tempting to substitute one's creativity by simple compilation from appropriate digital resources. This paper discusses the problems of detecting plagiarism in large collections of semi-structured electronic texts. Overlaps in and similarit...

متن کامل

Application of Radon Transform in Detecting Turning Angle of Bodies and in Reading Multi - Lingual Documents

Recently, image processing technique and robotic vision are widely applied in fault detection of industrial products as well as document reading. In order to compare the captured images from the target, it is necessary to prepare a perfect image, then matching should be applied. A preprocessing must therefore, be done to correct the samples’ and or camera’s movement which can occur during the...

متن کامل

مدیریت کلید در سیستم‌های مدیریت حقوق دیجیتال در حالت برون‌خطی

By expanding application of digital content in the world of information technology, supervision and control over the data, and also preventing the copy of documents is considered. In this relation digital rights management systems are responsible for the secure distribution of digital content, and for this purpose the common functions in the field of cryptography and utilize Digital watermarkin...

متن کامل

Application of Radon Transform in Detecting Turning Angle of Bodies and in Reading Multi - Lingual Documents

Recently, image processing technique and robotic vision are widely applied in fault detection of industrial products as well as document reading. In order to compare the captured images from the target, it is necessary to prepare a perfect image, then matching should be applied. A preprocessing must therefore, be done to correct the samples’ and or camera’s movement which can occur during the...

متن کامل

The Relative generality and precision of Evidence Based Medical Infor-mation Resources in the Recovery of Diabetes Information

Background and Aim: Relative generality and precision are two important criteria for measuring the efficiency and performance of information retrieval systems. The aim of this study was to compare the integrity and location of evidence-based bases in the digital library of Hamedan University of Medical Sciences in data retrieval of diabetes.    Methods: The design of this research is cross-sect...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001